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The effects of quantization (i.e., roundoff, truncation, etc.) and adder 
overflow, which are present in any special-purpose computer type realiza- 
tion of a digital filter, cause an otherwise linear system to become quite 
nonlinear. Moreover, the presence of such nonlinearities can cause the 
system's response to differ drastically from the ideal response (that is, 
from the response of the linear model of the filter) even when the level of the 
filter's input signal is, in a certain reasonable sense, small, and when the 
quantization effects are made arbitrarily small. 

In this paper we derive a criterion for the satisfactory behavior of second- 
order digital filters in the presence of such nonlinear effects. The criterion 
is shoiv7i to be sharp, in that we also present a procedure for constructing 
counterexamples which show that, for most filters ivhich violate the criterion, 
the response to some "small" nonzero input signal is not always even 
asymptoticallij close to the ideal response. 

I. INTRODUCTION 

The effects of quantization (i.e., roundoff, truncation, etc.) and adder 
overflow are present in any special-purpose computer type realization 
of a digital filter. When taken into account, these effects cause an 
otherwise linear system to become quite nonlinear. To date, the analysis 
of limit cycle phenomena in such nonlinear digital filters has been 
concerned with the study of the zero-input response of second-order 
filters. 1-3 A more fundamental problem is that of determining whether 
or not a filter's response to a nonzero input (the forced response) is in 
some meaningful sense close to the ideal response. This problem seems 
to have been ignored. 

If we consider input sequences, the levels of which are sufficiently 
small (in the sense that when the input sequence is applied to the linear 
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model of the filter, the response eventually lies within the open interval 
determined by the most positive and the most negative machine 
numbers), then it is tempting to conjecture, as if the system were 
linear, that when the filter's zero-input response can be made to admit 
only limit cycles of small amplitude by using sufficiently many bits in 
the representation of the data so that the quantization errors are made 
sufficiently small, then the deviation of the filter's forced response 
from the ideal can also be made small in the same manner. As will be 
shown by counterexamples, however, this conjecture is false. Thus, 
since the usual purpose of a digital filter is the processing of nonzero 
signals, a question of major importance becomes: How can it be deter- 
mined that, in the presence of quantization and adder overflow, a 
digital filter's forced response will be satisfactory? 

In this paper we analyze the forced response of second-order digital 
filters which employ a type of arithmetic that has been called saturation 
arithmetic.* The essential structure of a second-order digital filter is 
shown in Fig. 1 where, for given real numbers a, b the filter's output 
sequence 1 v w , k = 1, 2, ••• , is uniquely determined by the input 
sequence u w , k = 1, 2, • • • , and by u (_1) , v w , the initial values of the 
filter's state variables. We develop a criterion by which satisfactory 
behavior of the filter can be determined. The criterion is shown to be 
sharp, in the sense that our counterexamples show that for most filters 
which violate the criterion, the forced response is not always close to 
the ideal response. 

More precisely, we show that when the filter's coefficients a, b are 
determined by any point lying within the open crosshatched region 
of Fig. 2, and for any input sequence whose level is small (in the sense 
mentioned earlier), then the response of the nonlinear filter will be 
asymptotically close to the ideal response. On the other hand, we show 
that when the filter's coefficients are determined by any point lying 
within the shaded regions in the lower corners of the triangle of Fig. 2, 
and when certain very reasonable assumptions are satisfied concerning 
the nature of the quantization, then there exist input sequences the 
levels of which are also small, but for which the filter's response is not 
asymptotically close to the ideal response. 



* The definition of this term is given in Section II. 

1 In many applications some linear combination of the quantities v w , n ( * j ), 
w (*-2) i s taken to be the filter's output at the kth time instant. This additional com- 
plication has no bearing on the matters considered here. For simplicity, therefore, 
we consider the sequence v (k) to be the filter's output. 
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Fig. 1 — Second-order digital filter. 



II. SECOND-ORDER FILTERS 

The usual method of designing digital filters 4 employs the inter- 
connection of many second-order filters. The analysis and design of 
second-order digital filters is therefore a problem of considerable 
practical importance. 

The behavior of the digital filter of Fig. 1 is characterized by the 
linear difference equation 
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Fig. 2 — Region determining filter coefficients for which the filter's forced response 
can be made asymptotically close to the ideal response (crosshatched region), and 
region determining coefficients for which the forced response will not always be 
close to the ideal response (shaded region). 



866 



THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1972 



where A denotes the 2 X 2 matrix 



A = 



1 
b a 



(lb) 



and w lk) is a two-dimensional vector (specif ying the state of the system 
at the fcth time instant) the second component of which, w l 2 k) , corre- 
sponds to the digital filter's output sequence v (k) . 

In any special-purpose computer type realization of the digital filter 
of Fig. 1 the ideal behavior specified by (1) can be only approximated. 
At each time instant, the output of the summation point can assume 
only one of a finite number of values. Therefore, the actual value of 
the summation point's output is given by an expression such as 

w m = /(ffli (M ' + bv ik ~ 2) + u {k) ) + e lk) , 

where the function / accounts for adder overflow and the sequence e U) 
accounts for the quantization error that is inherently present. The 
equality /(£) = £ is satisfied only in a certain neighborhood of tpe origin 
which we take to be the interval —1 ^ £ ^ 1. We consider filters 
employing saturation arithmetic; that is, we define /(£) = — 1 for £ < — 1 
and/© - lfart > 1. 

When the effects of quantization and adder overflow are taken into 
consideration, the digital filter of Fig. 1 is then characterized by the 
nonlinear difference equation 
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where the state of the system at the fcth time instant is now specified 
by the two-dimensional vector r c *\ The mapping 7^ is defined by the 
relation 



F 



x 

l/(y) 



(3) 



Since the purpose of our study is to examine the effects of quantization 
and adder overflow on the forced response of digital filters, we are 
interested in comparing the solutions of (1) and (2) when the equations 
are given identical input sequences and identical initial conditions. We 
make the reasonable assumption that we are concerned only with digital 
filters whose linear model, i.e., eq. (1), is asymptotically stable. It is 
well known that (1) characterizes an asymptotically stable linear 
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system if and only if each eigenvalue of the matrix A has magnitude 
less than unity. The eigenvalues of A (the roots of the polynomial 
X 2 — aX — 6) are known to have magnitude less than unity if and only 
if the coefficients a, b have values determined by points that lie within 
the large open triangular region shown in Fig. 2 (determined by the 
straight lines: 6 ± a = 1, b = — 1). 

It is clear that so long as the niter's input sequence is such that the 
solution of the linear equation (1) is continually being driven into the 
region 1 || w (k) || > 1, then there is little point in trying to compare 
the solutions of (1) and (2); it being clear at the start that at each such 
time instant, they will differ by at least the amount by which || w (k) \\ 
exceeds unity (plus or minus the quantization error e {k) which, pre- 
sumably, will be small). At the other extreme, if it is known in advance 
that the initial conditions and the input sequence are (small enough 
in magnitude) such that the solution of the linear equation (1) is within 
the range || w ik) || ^ 1 — 5, for some 8 > 0, and for all k = 1, 2, • • • , 
then there is no problem. That is, it is clear at the outset (due to the 
assumption that the linear system is asymptotically stable) that the 
solutions of (1) and (2) will be made arbitrarily close for all such inputs, 
by simply causing the magnitude of the quantization error e to be 
bounded by a sufficiently small number. In effect, the nonlinear function 
/ is then not present; we are simply comparing the responses of the 
same stable linear system to two slightly different inputs. 

The interesting question which we shall consider is the one which 
follows. Suppose we assume only that the filter's input is such that the 
ideal response, the solution of the linear system (1), eventually (i.e., for 
all k sufficiently large) satisfies || w lk) || ^ 1 — 8, for some 5 > 0.* Then, 
when is the same thing (i.e., || r lk) || ^ 1 — 5 for some 8 > 0, and all k 
sufficiently large) true for the solution of eq. (2)? Thus, we are interested 
in knowing when the gross effects of the nonlinearity are simply of a 
transient nature and hence, aside from such transient effects, when can 
the filter's response be made as close to the ideal as desired by simply 
causing the quantization error to be sufficiently small (i.e., by using 
a sufficient number of bits in the representation of the data). Unfor- 
tunately, as our counterexamples will show, it is not always the case 



t For each w = {w\, w- 2 ) T we define \\w\\ = max (|t»i|, I102I}. 

t The inequality ||w < * , || ^ 1 might seem more reasonable here. The necessity to 
write 1 — 5 on the right-hand side is the small price that we must pay for the freedom 
to treat the quantization error in the relatively simple manner that we have chosen. 
By considering the quantization error at each step to be simply a "small" input 
e fk) , we do not admit to the knowledge that, for example, in all sufficiently small 
neighborhoods of the points £ = ±1, the quantization (be it roundoff, truncation, 
or whatever) will be done in such a manner that |£ + e lk) \ g 1. 
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that this will occur in the nonlinear system whenever the linear system's 
response satisfies || w m || 2g 1 — S for some 5 > and all sufficiently 
large k. 

With our objective thus being to compare the asymptotic behavior 
of the solutions of (1) and (2), and since the linear system (1) is assumed 
to be asymptotically stable, it is clear that we may drop the requirement 
that the equations have the same initial conditions. This follows, 
of course, from the fact that the initial conditions of (1) do not affect 
the solution's asymptotic behavior. 

By including the quantization effects in the linear model of the filter, 
the system is then described by the equation 



= As {k) + 
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k = 0, 1,2, 



(4) 



whose solution can be made arbitrarily close to the solution of (1) by 
simply requiring that all \e lk) | be sufficiently small. Let us assume, 
therefore, that the | e {k) | are at least small enough that there exists 
5' > and a nonnegative integer K such that, for all nonnegative 
integers k ^ K, 



Letting 

z lk) =r (k) - s ik) , k = 0,1,2, •••, (6) 

we find, from (2) and (4), that the sequence z ik) is determined by the 
equation 



; <t+1) = F[Az w + 









(i+1) 



k = 0,1,2, 



where z (0) = r (0) - s (0) and, for k = 0, 1, 2, • 
which, according to (5), with e = 1 — 5', satisfies 

I v lk+1) I < e, for k ^ K. 



(7) 



(8) 



We take as our objective, therefore: To determine when, for any 
sequence v lk+1) , k = 0, 1, 2, • • • , satisfying (8) for some e in the interval 
^ e < 1, and some nonnegative integer K, the solution of (7) satisfies 
lim t _ 00 ||2 (A ' || = 0. 

We note at this point that our objective stated in the preceding 
paragraph is similar to the objective in Ref. 3 (see the paragraph 
immediately following eq. (8) of that paper) where the control of 
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limit cycles in the zero-input response of second-order digital filters is 
considered. The important difference between the two objectives is 
that here we must accommodate any value of e in the interval [0, 1). 
In Ref. 3, however, it was only necessary to consider bounds on the 
sequence j/* +1) that were "sufficiently small". The consequences of 
this difference are great. It will be clear that a much more delicate 
analysis is required here than that in Ref. 3. 

III. ANALYSIS OF THE FORCED RESPONSE 

We now determine, in accordance with the objective explained in 
Section II, a criterion for the satisfactory behavior of the forced response 
of second-order digital filters in the presence of quantization and adder 
overflow. We consider filters employing saturation arithmetic; that is, 
we define the function / of Section II by 

[-1 for £ < -1 

M) = \ k for -1 ^ S$. 1 (9) 

, 1 for £ > 1. 

The following theorem is fundamental to our analysis. 

Theorem 1: Let the matrix A be defined by (lb) in which the values of 
a, b are specified by some point lying within the open triangular region 
of Fig. 2 (determined by the straight lines: 6±a=l,6=— 1). Let the 
mapping F be defined by (3) in which the junction f is specified by (9). 
Then, for any sequence v <k+1) , k = 0, 1, 2, • • • , satisfying (8) for some e 
in the interval ^ e < 1 and some nonnegative integer K, the solution of 
(7) satisfies linu—oo \\z w \\ = provided that there exists a real number a 
such that 

1 - crV > 0, (10) 

[1 - b 2 - (1 - <r)a 2 } 2 - a 2 [a + (2 - <x)b} 2 > 0, (11) 

and 



Vl - crV + V[l ~ b 2 - (1 - a)a 2 ] 2 - a\cr + (2 - a)b] 2 

> \b 2 - (1 - <r)a 2 |. (12) 

The proof of Theorem 1 is given in the Appendix. We now seek to 
determine those points lying within the triangular region of Fig. 2 
which specify values of a, b such that the inequalities (10), (11), (12) 
are satisfied for some value of <r. 
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We begin by examining the case in which a = 0. In this case (10) is 
satisfied for all a, b and, as shown in the Appendix, (11) is satisfied 
for only those values of a, b specified by points lying within the open 
crosshatched region of Fig. 5 which, for a = 0, is the open crosshatched 
region of Fig. 3. By squaring each side of the inequality (12), it is 
easily shown that that inequality, with a = 0, is equivalent to 

(1 _ a 2 - b 2 ) +V(1 -a 2 - & 2 ) 2 -4a 2 6 2 > 0. 

Since values of a, b specified by points lying within the crosshatched 
region of Fig. 3 satisfy 1 — a 2 — b 2 > 0, it is clear that all such values 
(and only those values) of a, b satisfy (10), (11), and (12) for a = 0. 

For negative values of a and for o- ^ 2 it is clear that the crosshatched 
region of Fig. 5 lies interior to the crosshatched region of Fig. 3. Thus, 
consideration of such values of <r can determine no values of a, b that are 
not already determined in Fig. 3 by consideration of the <r = case. 

We now show that values of a in the interval 1 ^ <r < 2 yield no 
values of a, b satisfying (10), (11), and (12) that cannot also be deter- 
mined by considering some value of a in the interval < a ^ 1. Let & 
satisfy 1 ^ & < 2 and then define a = 2 — &. Clearly < £ £ 1. 
Now, if (10) is satisfied for a = &, then, clearly, (10) is also satisfied 
for a = a. The expression on the left-hand side of (11) can be rewritten 
as (1 - 6 2 ) 2 + a 2 {[a 2 - 2(1 + 6 2 )] - [a 2 - (1 - 6) 2 ]<r(2 - «r)}. The 
form of this expression shows that it has the same value for a = & and 
o- = <r. Finally, it is clear that if (12) is satisfied for <r = &, then (12) is 
also satisfied f or o- = a since [using our observations regarding (10) 
and (11)] the left-hand side of that inequality is not decreased by 



a 2 +b 2 = i 




Fig. 3 — Region in which inequality (11) is satisfied for a = 0. 
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replacing a with B, and since 

| b 2 - (1 - &)a 2 | = | b 2 + (1 - a)a 2 \ ^ \ b 2 - (1 - <f)a 2 |. 

There remains to consider only those values of a in the interval 
< <r ?£ 1. Thus, for each such value of a we wish to determine the 
values of the parameters a, b specified by points lying within the open 
crosshatched region of Fig. 5 and, from (10), within the open region 
specified by | a \ < l/a, for which the inequality (12) is satisfied. It is 
not difficult to show that for each value of a satisfying < a ^ 1 the 
function 

<P.(a, &) 

= Vl - a 2 a 2 + V[l - b 2 - (1 - a)a 2 ] 2 - a\a + (2 - <r)b\ 2 

- \b 2 - (1 - a)a 2 \, 

whose domain is that portion of the open crosshatched region of Fig. 5 
where \a\ < l/tr, is monotone decreasing in | a \ for each value of b 
in the interval — 1 < b ^ 0. Thus, the region in which the inequality 
(12) is satisfied is easily located by determining the curves (p„(a, b) = 0. 
Moreover, because of the above observation concerning the monotonicity 
of tp„ , it is easy to determine these curves numerically. Several such 
curves, for various values of a, are shown in Fig. 4. 




Fig. 4 — Location of the <p„(a, b) = curves for several values of a. 
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The region in which the inequalities (10), (11), (12) are satisfied for 
some real number, a, is the union of the regions determined by the 
p # ( 0j 6) = curves for ^ a % 1. The numerical results show that 
this region has the shape indicated by the crosshatched area in Fig. 2. 
The boundary of this region appears to be determined by the straight 
lines 6 ± a = 1 for b ^ — f, and by the ellipse a 2 + 86(1 + 6) = 

for b £ -i 

It is clear that there are several ways in which our analysis could be 
refined in order to provide the possibility of improving upon the result 
of Theorem 1. In the next section, however, we show a fundamental 
limitation on the extent of any such improvement. We show there, how 
to construct counterexamples which demonstrate that for a certain 
large portion of the uncrosshatched area of the triangular region of 
Fig. 2 (in particular, the shaded areas in each lower corner) the con- 
clusion of Theorem 1 is, in fact, false. 

IV. COUNTEREXAMPLES 

We now show how to construct the counterexamples which have 
been referred to in the preceding sections. We begin by showing that, 
when the function / is defined by (9), and when the values of the filter's 
coefficients are determined by any point lying within the open shaded 
regions in each lower corner of the triangle shown in Fig. 2, then there 
exist nonzero initial conditions and, for some e < 1, a periodic input 
sequence v {k+1) satisfying (8) such that the solution of (7) is periodic 
(and thus does not satisfy lim t -„ || z w || =0). 

We first consider values of a, b determined by any point lying within 
the shaded open region in the lower left corner of the triangle of Fig. 2. 
In particular, we assume that 

ab > 1. (13) 

It is also clear that the inequality 

a < -b 2 (14) 

holds for any such point. Denoting the initial condition z <0) by z (0> = 
(x (0> , y m ) T = (p, q) T , and considering an input sequence having period 
three, specified by v ll) = 0, w w - 1 - p, v w = — 1 — g, it is clear 
from Table I that (7) has a nonzero periodic solution provided that 
values of p, q can be found such that the inequalities specified in paren- 
theses in Table I are satisfied. 

The inequalities in the v lk+1) column [which must hold in order that 
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the input sequence satisfy (8) for some e < 1] and the inequalities on 



the first line of the column labeled "bx lk ' 
alent to: 

< p < 2, 



+ ay™ + 



<* + !)» 



are equiv- 



-2 < q < 0, 
1 < bp + aq < 1. 



(15) 



Thus, so long as we consider only positive values of p and negative 
values of q, these inequalities will always hold whenever p and q have 
sufficiently small magnitude. The remaining two inequalities specified 
in Table I will be satisfied provided that 



and 



(1 - ab)p - (a 2 + b)q < 



(a + b 2 )p - (1 - ab)q < 0. 



(16) 

In view of the inequalities (13), (14), it is clear that there exist values 
of p > and q < such that (16) is satisfied. Moreover, it is clear 
that the magnitudes of p and q can be scaled such that the inequalities 
(15) are also satisfied. Thus, there exists a nonzero periodic (of period 
three) solution of (7). 

For values of a, b determined by any point lying within the open 
shaded region in the lower right corner of the triangle of Fig. 2 a similar 
line of reasoning shows that a nonzero solution of period six can be 
obtained. The existence of such a solution is easy to demonstrate by 
noting the odd-symmetry of the function / and, with the initial condition 
z' ' = (P, qV, showing that, with w w = 0, r m = 1 + p, p l3) - - 1 + q, 
there exist values of p, q such that z i3) = (—p, —q) T . We omit the 
details. 

The above procedure for constructing counterexamples is concerned 
explicitly with the solutions of (7). The simple relationships between (7) 
and the original equations of interest, i.e., eqs. (1) and (2), are described 



Table I— Construction of a Periodic Solution for Equation (7) 
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in Section II. It is instructive, however, to consider explicitly the 
implications of such counterexamples concerning the solutions of (1) 
and (2). 

Let any values of the parameters a, b, determined by some point 
lying within the open shaded regions in the lower corners of the triangle 
of Fig. 2, be given. Consider any counterexample constructed according 
to the above procedure. Then, assuming that the quantization occurs 
in an appropriate manner (or, assuming that there is no quantization) 
it is a straightforward matter to use the relationships between the 
variables of (1), (2), and (7) to demonstrate a periodic input sequence 
u ik) and appropriate values of the initial conditions v l ~ l) , v (0) such that 
the response of the linear model of the filter of Fig. 1 [i.e., w lk) , the 
solution of (1)] is asymptotic to a periodic sequence, and satisfies 
|| w lk) || < 1 for all sufficiently large k, while the response of the non- 
linear filter [i.e., r lk) , the solution of (2)], although also periodic, is such 
that || w lk) — r'* 1 || does not approach zero as k — » °°. 

These counterexamples, while clearly demonstrating that there exists 
potential trouble whenever a filter's coefficients are assigned such "bad" 
values, do not show that such behavior will necessarily be possible 
for some particular filter. They do not demonstrate, for example, that 
with a particular (specified) kind of quantization, and with a particular 
set of permissible values for the filter's input sequence, there will 
necessarily exist a periodic input sequence for which the linear, and 
the nonlinear digital filters have asymptotically different responses. It 
is possible, however, by considering at the outset the details of the 
quantization and thereby imposing somewhat different constraints 
(to those of Table I) on the values chosen for p, q, j> (1) , to construct 
certain counterexamples which show just that. 

We assume that the values specified for the parameters a, b are 
determined by a point lying within the open shaded region in the lower 
left corner of the triangle of Fig. 2. (A similar development could, of 
course, be considered for the other shaded region.) We also assume 
that a certain finite set Q of allowable machine numbers, satisfying 
x t Q => | x | ^ 1, is specified. Thus, we assume that for the nonlinear 
digital filter with quantization the variables u ik) , v ik) , v a_1> , v lk ~ 2) of 
Fig. 1 can assume only those values specified by the set Q. Furthermore, 
we assume that the filter employs saturation arithmetic with the over- 
flow and quantization effects both specified by a certain function /„ ; 
that is, given any values for u lk) , v l *~ u , v {k ~ 2) taken from the set Q, 
the value for v {k) appearing at the output of the summation point in 
Fig. 1 is specified by 
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v m = /,(»»-» + bv (k ^ + u ( *>). (17) 

If, for example, with a = — 1.3 and 6 = —0.9, the values u ik) = 0.0, 
y (*-D = _q g^ „(*-*) = j q are cons idered; and if the quantization is 
accomplished by simply rounding the ideal output of the summation 
point to the nearest tenth, the value v w specified by (17) is v m = 0.3. 

Clearly, if the set Q imposes sufficiently severe (indeed, for practical 
purposes, unreasonable) restrictions on the values that the input sequence 
and the initial conditions may assume, then it will be impossible to 
construct a counterexample. It is no surprise, therefore, that the success 
of the process to be described depends upon the assumption that the 
quantization is "sufficiently fine" (that is, that there are sufficient 
quantization levels distributed throughout the interval [— 1, 1]), and 
that when | av {k ~ l) + bv (k ~ 2> + u (k) | g 1, the actual output of the 
summation point is reasonably close to the ideal value, that is, 

JMv"-" + bv ik ~ 2) + u w ) « m™ + bv (k ~ 2) + u lk) . (18) 

We first note that the values of a, b determined by any point lying 
within the open shaded region in the lower left corner of the triangle of 
Fig. 2 are such that — 1 < a — b < 1. Thus, since b/a > 0, we also have 
— 1 < a — b + b/a and a — b < 1. This ensures that the open intervals 
(—1, 1) and (a — b, a — b + b/a) overlap. Hence, if the quantization is 
sufficiently fine, there exists u w c Q such that 

-a/b<0<b-a + w (l) < b/a < 1. (19) 

Thus, for such a value of u ll) , 

b - o(6 - a + m (1) ) < 
and 

a + 6(6 - a + u (I) ) < 0. 

Hence, for sufficiently fine quantization, there exist w (2> , u <3) e Q such that 

1 + 6 - a(b - a + m (1) ) - u l2) < 0, (20) 

and 

1 + a + 6(6 - a + u w ) + u (3) < 0. (21) 

We let r maT denote the most positive value in the set Q and let r min 
denote the most negative value in Q. We also let 

ri" = f Q (ar miD + 6r max + u (l) ), 



,.(2) _ .. 

'2 — 'inax> 



= ?'„ 
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Now, assuming that (18) holds, it can be expected, due to (20) and 
(21), that there exist p, q such that 

(1 - ab)p - (a 2 + b)q = 1 - br miD - ar ( 2 l) - w (2) < 0, (22) 

( a + b 2 ) p _ (i _ ab)q = 1 + ar mRX + br' 11 + u (3) < 0. (23) 

Moreover, if the quantization is sufficiently fine, the values for w <2) and 
M (3) can be chosen such that 1 - br min - ar 2 l) - u (2) « and 1 + 
<w max + br 2 l) + w (3> » 0, and such that the values of these expressions 
are in the proper ratio that, in fact, small values of p > and q < 
are determined by the equations in (22), (23). Thus, since for sufficiently 
fine quantization 

ar min + 6r maK + w (1) « b - a + u a) , (24) 

and, due to (19), it is reasonable to expect that there exists v ll) such that 

-1 < bp + aq + , (1) = ar min + &r mnI + u w < 1. (25) 

Furthermore, for p > 0, g < small, we expect that the following 

inequalities also hold: 

-1 <r 2 (1) - bp-aq< 1, (26) 

-1 <7W-p < 1, (27) 

-1 <r roin - q < 1. (28) 

Assuming therefore that the values of u (l \ u <2) , w <3) , p, g, r c " are 
such that (22), (23), (25), (26), (27), and (28) hold, we proceed with 
the construction of a counterexample by simply assigning the values to 
the remaining variables that are dictated by the relationships specified 
in Section II. In particular, we let 



(21 



-W-p, e (2) = -1+r. 



.(3) _ m m a (8) 



= r min - q, e ld ' = 1 + r min , 



(2) = 1 - P, 



(3) 



- -1 - q, 



and 



r "> _ r (») -») _ s < 3 > 

~i — r 2 , i>i — o 2 , 

'1 — ' 2 > 6 1 — *2 > 

r (3) _ _(2) (8) (2) 

'1 — ' 2 ) 6 1 — o 2 • 
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At this point, one final step remains in our construction of a counter- 
example. We have obtained periodic solutions of eqs. (4) and (2), with 
the solution of (4) satisfying |j s ik) | < 1. We would like to obtain the 
corresponding periodic sequence to which the solution of (1) is asymp- 
totic. This sequence, which we shall call ib , is easily determined by 
the equations 



u 



W" 




1 


-b 


— a 


4P 


= 


—a 


1 


-b 


xb?\ 




_-& 


—a 


1 


w[ l) = *f\ 






w[ 2) = iH l) , 






*? 


> = 


tf\ 







We have now found a true counterexample only if the values of ib lk) 

< 1. It is reasonable to expect that this 



are also such that || w 

inequality will hold, however, since || w lh) — s lk) || is known to be 
small provided that the values of e ( ", e <2) , e <3) are small, and these 
terms will be small whenever the quantization is sufficiently fine [note 
that e (l) = r.'" — bp — aq — v iv , and recall the equality expressed 
in (25)]. 

Computer programs have been written which use the above process 
for constructing counterexamples, and which simulate the behavior of 
linear and nonlinear digital filters. It has been our experience, based 
upon experimentation with these programs, that counterexamples of 
the type described above can easily be found for values of the coeffi- 
cients a, b determined by points lying within the shaded region in the 
lower left-hand corner of the triangle in Fig. 2 even when the quantiza- 
tion is extremely coarse, much coarser than the quantization occurring 
in current practical digital filter realizations. We give, for example, 
the following numerical counterexample, constructed according to the 
above procedure, in which we have intentionally considered very coarse 
quantization, and have also made the task even more difficult by 
choosing u , w (2) , w <8> in, obviously, a somewhat less-than-optimum 
manner, with the result that | p \ and | q | are larger than necessary. 



This does, however, cause the resulting sequences r 



w 



to be quite 



different. 

We assume that the coefficient values a = —1.3, b = —0.9 have 
been specified. We also assume that 
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Q = {-0.9, -0.8, ••■ ,0.9,1}. 

We assume that the quantization is performed by simple rounding, at 
the output of the summation point, of the ideal sum to the nearest 
tenth. We then have r max = 1, r min = -0.9, and therefore, choosing 



u 



(1) = 0, u l2) = 0.6, u m = 0.2, 



it follows that 

r <'> = 0.3, r< 2) = 1, r< 3) = -0.9. (29) 

We find that the approximate values of p, q, specified by (22), (23) are: 
p = 0.711, q = —0.128. Following the above outlined procedure, we 
find that all of the required relationships hold. The resulting periodic 
sequence w n) to which the sequence w ik) is asymptotic is specified by 
the following approximate values: 

w l 2 l) = 0.905, w l 2 2) = 0.135, <° = -0.790. (30) 

Note that quite different solutions are specified by (29) and (30). 

V. THE FORCED RESPONSE AND INPUT SCALING 

We have shown in Section IV that the forced response of a stable 
second-order digital filter employing saturation arithmetic might not, 
for some inputs, be even asymptotically close to the filter's ideal response 
(the response of the linear filter) if the coefficients a, b are specified 
by a point lying outside the crosshatched region of the triangle in 
Fig. 2. More precisely, we have shown that this certainly happens for 
coefficient values determined by points lying within the shaded regions 
in each lower corner of that triangle (so long as certain reasonable 
assumptions hold concerning the nature of the quantization). Thus one 
concludes that, when designing a filter, it is desirable to avoid choosing 
such coefficient values. In practical applications, however, it might be 
the case that due to other considerations such a choice cannot be 
avoided. Then it is clear that the designer must be careful to impose 
appropriate restrictions on the filter's input sequence and on its initial 
conditions. He might, for example, scale the input sequence such that 
it is always small enough. The question thus arises: How small is 
"small enough"? One obvious answer to this question is that the input 
and the initial conditions be required to be small enough that the re- 
sponse of the linear filter [described by (1)] satisfies, for some 5 > and 
all k = 0, 1, 2, ■ ■ • , the inequality || w w || ^ 1 - 5. Then, by using 
sufficiently many bits in the representation of the data, the quantization 
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error can always be made sufficiently small that the adder overflow 
nonlinearity is not encountered. 

The results contained in a paper 3 on limit cycles can provide another 
answer to this question. This answer requires consideration of only the 
asymptotic nature of the input sequence, and applies to filters using a 
variety of kinds of arithmetic including, in particular, saturation 
arithmetic. It is clear from the analysis presented in Ref. 3, that it is 
sufficient that the input sequence w ( * +1> and the quantization error 
sequence e u+1) be such that the solution of (4) satisfy, for some non- 
negative integer K, the inequality 

\\s (k) || + |e (t+1) | < 5, for A; ^ K, 

where 8 is one of the bounds specified in Theorem 1 of Ref. 3 for the 
sequence v lk+1) . In the case of saturation arithmetic we have 

|2 - lal 1 - 161 



5 = max 



.2 + lal ' 1 + \b 



Then, it is clear (by Theorem 1 of Ref. 3) that the solution of (7) 
satisfies lim*-.,,, || z (k) || = 0, which therefore ensures proper asymptotic 
behavior of the forced response of the nonlinear filter. 



VI. ACKNOWLEDGMENT 



The author is pleased to acknowledge the helpful comments of his 
colleagues J. F. Kaiser and I. W. Sandberg concerning this work. 

APPENDIX 

The proof of Theorem 1 follows. The proof uses the following well- 
known result concerning the application of Liapunov's "second method" 
to the study of the stability of difference equations. 5-7 

Lemma 1: Let G denote a subset of the n-dimensional Euclidean space 
E n containing the origin 6. If there exist continuous functions W:G -+E 1 , 
V:G — » E 1 , and if there exists a nonnegative integer K such that: 

(i) W(z) > for allzzG,z ^ 0, 

(it) w(e) = o, 

(Hi) V(z) ^ for all ztG, 

(iv) AV(k, z) = V(g(k, z)) - V(z) ^ -W(z) for all k ^ K and all 
ztG, 

then each solution of the difference equation z lk+i) = <7(fc, z (k) ) which 
remains in G for all k ^ K approaches the origin as k —> °o. 
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For any particular application, the effectiveness of Liapunov's method 
is of course highly dependent upon the appropriateness of the particular 
Liapunov function V that is chosen. The quadratic form that will now 
be described is quite useful for our purposes. 

For any given values of the parameters a, b which specify an asymp- 
totically stable linear digital filter, and with the eigenvalues of the 
matrix A of (lb) being denoted by Xi , X 2 , let the Liapunov function V 
be defined by 

V(z) = z T Bz, (31) 



with 

~M 2 + 1*2 1 2 + 2 m -era 



B = 

-aa 2 



(32) 



where the values of a and n are yet to be determined. 

In the following lemma we determine, for any given value of a, those 
values of n for which the matrices B and B — A T BA are positive 
definite. 

Lemma 2: Let a be a given real number. Then, necessary and sufficient 
conditions for the matrices B and B — A T BA both to be positive definite 
for values of a, b which specify an asymptotically stable linear digital filter 
are: that the values of the parameters a, b be restricted to those values 
specified by points lying within the open crosshatched region of Fig. 5, 
and that, with mi < M2 specified by 

Mli2 = i{l + 6 2 -(l- < r)a 2 -(|X 1 | 2 + |X 2 | 2 ) 



± V[l - 6 2 - (1 - aWT - a\a + (2 - a)b] 2 }, (33) 

a value be assigned to n such that 

Hi < (1 < fJ.2 ■ 

Proof: It is clear that a necessary and sufficient condition for the 
matrix B of (32) to be positive definite is that det B > 0, which is 
equivalent to the inequality 

M> -HlX.r + IX^+iaV. (34) 

The matrix B — A T BA, which has the form 
"(IX,! 2 + M 2 ) - 2& 2 + 2 M - a[a + (2 - a)b] 

- a[a + (2 - a)b] 2 - (IXJ 2 + |X 2 | 2 ) - 2(1 - a)a 2 - 2 M J 
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l/ll— «r1 




a*=2/(|l-a|tl) 
b*a(|l-ir|-|)/(|l-a| + 

Fig. 5— Region of the a-b plane in which the matrices B and B — A T BA both 
may be positive definite. 



is positive definite if and only if det (B - A T BA) > and 

M > V ~ Kl X. | 2 + | X 2 | 2 ). (35) 

As is easily verified, the inequality det (B - A T BA) > is equivalent to 

-4 M 2 + 4[1 + b 2 - (1 - a)a 2 - (|Xx| 2 + |X 2 | 2 )] M 

+ {-(|A.| 2 + NT + 2[1 + b 2 - (1 - (OalflX,! 2 + |X 2 | 2 ) 

- a 2 [a + (2 - (r)6] 2 - 46 2 [1 - a 2 (l - a)]} > 0. (36) 

We view the left-hand side of the inequality (36) as a quadratic function 
in the variable n whose coefficients depend upon the values of the 
parameters a, a, 6. Clearly, for any choice of these parameter values, 
(36) will not be satisfied for all large values of | /* | . Thus, a necessary 
and sufficient condition for the existence of real values of n satisfying (36) 
is that the quadratic function on the left-hand side of (36) have distinct 
real zeros mi < M2 ■ Moreover, if such is the case, (36) will be satisfied 
if and only if Ml < M < M2 . The zeros M i , M2 are given by (33) and, 
therefore, they are real and distinct if and only if 

I 1 - b 2 - (1 - a)a 2 I > I a I • I a + (2 - tr)b | . (37) 

We now prove that for any given value of a, the values of the param- 
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eters a, b specified by points lying within the open triangular region 
of Fig. 2, and which satisfy (37), are those (and only those) values 
of a, b specified by points lying within the open crosshatched region 
of Fig. 5. 

We begin by first showing that there exist no such values of a, b for 
which 1 — b 2 — (1 — a)a 2 < 0. Let us assume that this inequality holds 
for some value of a. Then, since 1 - b 2 > 0, it follows that a < 1. 
Now, either 

-«t/(2 - <r) £ b < 1, (38) 

or else 

-Kb < -a I '(2 - <r). (39) 

If (38) holds, then (37) is equivalent to 

-1 + b 2 + (1 - a) | a | 2 > | a | [<r + (2 - a)b], 

or 

(b - | a | - l)[b - (1 - <r) I a | + 1] > 0. (40) 

If, however, (39) holds, then (37) is equivalent to 

(b + | a | - 1)[6 + (1 - a) | a | + 1] > 0. (41) 

By considering first only nonnegative values of a, and then considering 
only nonpositive values of a, it is easy to use Fig. 5 and, by inspection, 
determine that there exist no values of the parameters a, b specified 
by points lying within the triangular region, such that both (38) and (40) 
hold. Similarly, it is easy to verify that the same is true regarding 
inequalities (39) and (41). 

We now assume that the parameters a, a, b are to be chosen such 
that 1 — b 2 - (1 — <r)a 2 ^ 0. Then there are three cases to consider: 

If <r ^ 1, it follows that a + (2 - a)b > and hence (37) is easily 
shown to be equivalent to 

(b + | a | - 1)[6 - | 1 - « H a | + 1] < 0. (42) 

If <r < 1 and (38) holds, then it follows that (37) is equivalent to 

(6 + |a| - 1)[6 + |1 - 0-|-| a| + 1] < 0. (43) 

If a < 1 and (39) holds, then it follows that (37) is equivalent to 

(b - \a\ - 1)[6 - | 1 - ff |-| fl| + 1] < 0. (44) 
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By first considering only nonnegative values of a, and then considering 
only nonpositive values of a, it is easy to use Fig. 5 and, by inspection, 
determine that the inequality (42) is satisfied if and only if the values 
of the parameters a, b are determined by points lying within the open 
crosshatched region of Fig. 5. Similarly, the inequalities (38) and (43), 
or the inequalities (39) and (44), hold if and only if the values of the 
parameters a, b are determined by points lying within the open cross- 
hatched region of Fig. 5. 

It can easily be shown that for any given value of a, and any values 
of the parameters a, b specified by points lying within the open cross- 
hatched region of Fig. 5, it follows from m > Mi that the inequalities (34) 
and (35) also hold. We omit the details of the algebra. □ 

Proof of Theorem 1: Let the Liapunov function V be defined for all 
z c G = E 2 by (31) and (32) with the values of a, a, b, n assumed to be 
such that both of the matrices B and B — A T BA are positive definite. 
It is clear that the equations 

z T {B - A T BA)z = c, c> (45) 

define a family of concentric ellipses, centered at the origin 6 in the 
x-y plane [where z = (x, y) T ]. The origin also lies between the two 
parallel straight lines bx -\- ay = ±(1 — e), each of which is tangent 
to exactly one (in fact, the same one) of the ellipses (45). Thus, there 
is a unique value of c* > such that 

c* = min \z T (B - A T BA)z : bx + ay - ±(1 - e)|. 

Let the function W be defined for all z t G = E 2 by 

W(z) = min \z T (B - A T BA)z, c*\. 

Thus, W(z) is defined by the positive definite quadratic form 
z T (B - A T BA)z for all points lying within the ellipse z T (B - A T BA)z = 
c*, and W(z) is defined by W(z) = c* for all other points in the x-y plane. 

It is clear that for each value of j/ ( * +1) for which (8) holds, the points of 
the x-y plane determined by bx -\- ay + j/* + " ^ 1 lie on the opposite 
side of the line bx + ay = 1 — e from the ellipse z T (B — A T BA)z = c*. 
The situation is similar regarding the points of the x-y plane determined 
by bx + ay + * ( * +u ^ -1 and the line bx + ay = - (1 - e). See Fig. 6. 

With 



g(k, z) = F(Az + 
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y 




Fig. 6— Location of the ellipse z T {B - A T BA)z « c* 



it follows that 

AV(k,z) = V(g(k,z)) - V(z) 
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-z T Bz. 



Thus, whenever | bx + ay + v (k+1) | ^ 1, we have 

AV(k, z) = -z T (B - A T BA)z ^ -W(z). 
When bx + ay + >' <i+,) > 1, 
A7(fc, z) = 

- {[(N 8 + M 2 ) + 2 ^ 2 - 2 <^ + [2 - (1Xl|Z + |Xz|2) " 2fi]y2 

+ 2«ra(l - p« +1) )y - 2(1 - /•♦»}■}; (46) 

and when 6x + ay + »> ( * +1) < -1, 

A7(fc,z) = 
- IKM' + M") + 2 M> 2 - 2 ^2/ + P " (M" + M 2 ) - 2 ^ 
-2«ra(l+/* +1> )y-2(l+i' < * +l) )"}. < 47 > 
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It is an elementary result of analytic geometry that a general second- 
degree equation of the form ax 2 -f- bxy + cy 2 + dx + ey + / = 
represents an ellipse if and only if b 2 — 4ac < 0. It follows that, if we 
consider the constant- AV loci in the bx + ay + j» +1J > 1 region, 
and in the bx + ay + v lk+i) < — 1 region of the x-y plane, a necessary 
and sufficient condition for these loci to be arcs of concentric ellipses is: 

4 M 2 - 4[1 - (| X, | 2 + | X 2 | 2 )] M + [crV - 2(1 \, | 2 + I X 2 | 2 ) 

+ (I Ai I 2 + I X 2 | 2 ) 2 ] < 0. (48) 

Furthermore, since AV(k, z) is continuous in z, and since the values of 
AV(k, z) along the lines bx + ay + v {k+l) = ±1 are given by AV(k, z) = 
—z T (B — A T BA)z, with B — A T BA a positive definite matrix, it is 
clear that when (48) is satisfied, the constant- AT^ curves specified 
by (46) (temporarily extending the domain of definition of that function 
to the entire x-y plane) are of the type shown in either Fig. 7a or Fig. 7b; 
that is, the line bx + ay + v ( * +1) = 1 intersects only certain constant- AV 
curves — in particular, only certain such curves for which the value of 
AV is negative. Thus, the center of the ellipses is situated to one side or 
the other of the line bx + ay + /* +1> = 1 in such a manner that the 
constant-AF ellipses for which AV is positive are not intersected by 
the line. Considering, however, that when AV(k, z) of (46) is evaluated 
at z = 6 its value is positive, it is clear that Fig. 7b is impossible. Thus 
[applying exactly the same reasoning to the constant-AF curves defined 
by (47)], it follows that whenever the inequality (48) is satisfied, the 




(a) (b) 

Fig. 7 — Possible shape of constant-AF curves defined by equation (46). 
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function AV(k, z) achieves its maximum for bx + ay + v lk+l) ^ Ion 
the line bx + ay + v ( * +1> = 1, and similarly for the behavior of AV(k, z) 
in the bx + ay + v lk+i) ^ —1 region of the x-y plane. It follows, there- 
fore, that there exists c' ^ c* > such that for 6a; + ay + ^ (fc+1) ^ 1, 

AV(k,z) ^ -d ^ -c* = -W(z). (49) 

Similarly, there exists c" ^ c* > such that for bx + ay + v lk * l) g -1, 

A7(Jfc, «) ^ -c" ^ -c* = -W(z). (50) 

We have shown that, with the functions V, W denned as specified 
above, the hypotheses of Lemma 1 are satisfied. Thus, the solution 
of (7) satisfies lim*-. || z (k) || = provided that the values of a, a, b, /* 
are such that B and B — A T BA are positive definite, and provided that 
(48) holds. 

We view the left-hand side of the inequality (48) as a quadratic 
function in the variable y. whose coefficient values depend upon the 
values of the parameters a, a, b. Clearly, for any choice of these param- 
eter values (48) will not hold for all large values of | /* I ■ Thus, a neces- 
sary and sufficient condition for the existence of real values of n satis- 
fying (48) is that the quadratic function on the left-side of (48) has 
distinct real zeros £, < fa ; moreover (48) will be satisfied if and only 
if Ai < m < A2 • The zeros £i , £2 are given by 



A1.1 - Ml - CM" + M 8 ) ±Vl - <rV]. (51) 

They are real and distinct if and only if the inequality (10) holds. 
According to Lemma 2, for any given value of a the matrices B and 
B — A T BA are positive definite for values of a, b that are specified by 
some point lying within the open triangular region of Fig. 2 if and 
only if iii < y. < mz , where mi » Ma are specified by (33). Thus, assuming 
that a, a, b satisfy (10) and (11), there exists a value of m such that B 
and B — A T BA are positive definite and such that (48) holds if and 
only if the open intervals (/*i , ju 2 ) and (£1 , £ 2 ) overlap. That is, if 
and only if ^i < £2 and £! < n 2 ■ Using (33) and (51), these last two 
inequalities are easily shown to be equivalent to (12). □ 
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